information-theoretic confidence bound
Information-Theoretic Confidence Bounds for Reinforcement Learning
We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent's uncertainty over the environment and quantifies the value of prior information. We show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.
Reviews: Information-Theoretic Confidence Bounds for Reinforcement Learning
It would be great to make it more accessible to a more general audience, as the ideas it contains are fairly intuitive at their core. One suggestion would be to include illustrative figures to convey the general intuition, for example for the case of the linear-Gaussian bandit, since the confidence sets have a natural geometric interpretation in terms of the variance of the posterior. The analyses of the examples given (linear bandits, MDPs, factored MDPs) essentially all follow a recipe made possible by the results relating the confidence bounds to the regret. Specifically, they are: 1) Construct a confidence interval based on the mutual information using the characteristics of the problem at hand (linearity/Gaussian noise assumptions for the bandit, specific forms of the prior for MDPs) 2) Bound the sum of the information gain. Combining these two then gives a regret bound.
Reviews: Information-Theoretic Confidence Bounds for Reinforcement Learning
The paper extends Russo and Van Roy (JMRL2016) work to provide information-theoretical analysis of Thompson sampling and UCB-like algorithms in more general setting. The three reviewers acknowledge the contributions, and the potential impact of connecting information-theoretical concepts to the design of algorithms. Reviewers have suggested ways to improve the manuscript. The authors should follow these directions, and in particular fix notations, include simulation results, and provide explanations about proofs when necessary. The contributions in this paper are "methodological", i.e., it proposes a framework to analyze the regret of certain algorithms.
Information-Theoretic Confidence Bounds for Reinforcement Learning
We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent's uncertainty over the environment and quantifies the value of prior information. We show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.
Information-Theoretic Confidence Bounds for Reinforcement Learning
Lu, Xiuyuan, Roy, Benjamin Van
We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent's uncertainty over the environment and quantifies the value of prior information. We show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.